Loan Book Design Assesment¶

Scenario: Financial Advisory Services is one of the departments supported by the Kenya Analytics Team. They have reached out to have a Loan Book designed that highlights key client metrics. Task Notes:

  • You will be required to use only Python and SQL for the below tasks
  • After deriving each metric, print out the first 5 rows of the new data frame
  • Use comments to arrange and describe your work.
  • For Tasks 1 and 3, you will be required to submit a Jjupyter notebook clearly demonstrating your work in completing the below tasks.

Task 1: Using Python read the provided dataset and derive the following metrics:¶

First task task was to load the data and do quick inspection of the first 10 rows.

Load the data¶
In [1]:
import pandas as pd
from datetime import date

#Read in the dataset
dd = pd.read_csv("~/Downloads/challenges/data/analytics_specialist_hiring_dataset_final.csv")

#View the first 5 rows of data
dd.head(5)
Out[1]:
contract_reference status start_date end_date next_contract_payment_due_date cumulative_amount_paid expected_cumulative_amount_paid nominal_contract_value deposit_amount birthdate gender l3_entity_id name expected_payment_progression
0 abc0001 Completed 2022-05-20T08:23:51.406303 2022-05-20T08:23:51.406303 2022-05-20T08:23:51.406303 0.0 0.00 0.00 0.0 1943-01-01 0 5121 Cash Sales IPP 1.000000
1 abc0002 Active 2022-05-25T13:28:49.873746 NaN 2022-08-31T18:34:01.654276 5131.0 6164.16 8171.86 2000.0 1937-01-01 1 5073 Individual loan 0.754315
2 abc0003 Active 2022-05-31T10:02:23.158972 NaN 2022-07-22T01:17:15.526819 3100.0 5273.52 6851.86 2000.0 1963-11-10 1 5072 Individual loan 0.769648
3 abc0004 Active 2022-07-05T11:49:03.801563 NaN 2022-08-18T13:34:13.983433 4350.0 6142.35 10451.86 3000.0 NaN 1 5084 Individual loan 0.587680
4 abc0005 Active 2022-05-31T06:31:25.977374 NaN 2022-07-04T13:30:36.436593 3450.0 7743.68 10271.86 2500.0 1998-05-10 1 5074 Individual loan 0.753873
a) PAR(Portfolio at Risk) Status.¶

Assuming the Finance request came in today(the day you receive the exercise), assign each client a PAR Status based on their repayment progression. This should be as a new column labelled ‘PAR status’ in the dataset.

In [2]:
#Format the next_contract_payment_due_date column 
dd["next_contract_payment_due_date"] = pd.to_datetime(dd["next_contract_payment_due_date"])

#Get Todays date
today = date.today()

#Add a column for todays date
dd['today'] = pd.to_datetime(today)

#Calculate the number of days before or after the next contract payment due.
#Label the column as difference
dd['difference'] = (dd['today'] - dd['next_contract_payment_due_date']).dt.days

#Format date inputs to look presentable
dd[["start_date", "end_date"]] = dd[["start_date", "end_date"]].apply(pd.to_datetime)
dd['start_date'] = dd['start_date'].dt.strftime('%d/%m/%y')
dd['end_date'] = dd['end_date'].dt.strftime('%d/%m/%y')
dd['next_contract_payment_due_date'] = dd['next_contract_payment_due_date'].dt.strftime('%d/%m/%y')
dd['today'] = dd['today'].dt.strftime('%d/%m/%y')

#Create Function to calculate PAR based on defination provided
def par (difference_in_days):
    if difference_in_days < 0:
        par = "On Time"  
    elif 0 <= difference_in_days <= 7:
        par = "PAR0-7"       
    elif 8 <= difference_in_days <= 30:
        par = "PAR8-30" 
    elif 31 <= difference_in_days <= 90:
        par = "PAR31-90"   
    elif difference_in_days >= 91:
        par = "PAR90+"
    return par

#Apply the function
dd['PAR'] = dd['difference'].apply(par)

#Subset the data for proper viewing and quick inspection.
par_dataset = dd[["contract_reference", "next_contract_payment_due_date", "today", "difference", "PAR"]]
par_dataset.head(5)
Out[2]:
contract_reference next_contract_payment_due_date today difference PAR
0 abc0001 20/05/22 25/10/22 157 PAR90+
1 abc0002 31/08/22 25/10/22 54 PAR31-90
2 abc0003 22/07/22 25/10/22 94 PAR90+
3 abc0004 18/08/22 25/10/22 67 PAR31-90
4 abc0005 04/07/22 25/10/22 112 PAR90+
b) Current Collection Rate¶

Derived by taking the Cumulative Amount Paid divided by (Cumulative Expected Amount Paid - Deposit)

In [3]:
#Calculate Current collection rate
dd["current_collection_rate"] = dd["cumulative_amount_paid"]/dd["expected_cumulative_amount_paid"]

#Subset the data for proper viewing and quick inspection.
current_collection_rate = dd[["contract_reference", "cumulative_amount_paid", "expected_cumulative_amount_paid", "current_collection_rate"]]
current_collection_rate.head(5)
Out[3]:
contract_reference cumulative_amount_paid expected_cumulative_amount_paid current_collection_rate
0 abc0001 0.0 0.00 NaN
1 abc0002 5131.0 6164.16 0.832392
2 abc0003 3100.0 5273.52 0.587843
3 abc0004 4350.0 6142.35 0.708198
4 abc0005 3450.0 7743.68 0.445525
c) Total Amount in Arrears.¶

Which is the expected amount to have been paid at this time minus what has been paid.

In [4]:
#Calculate total amount in arrears
dd["total_arrears"] = dd["expected_cumulative_amount_paid"]-dd["cumulative_amount_paid"]

#Subset the data for proper viewing and quick inspection.
total_arrears = dd[["contract_reference", "expected_cumulative_amount_paid", "cumulative_amount_paid", "total_arrears"]]
total_arrears.head(5)
Out[4]:
contract_reference expected_cumulative_amount_paid cumulative_amount_paid total_arrears
0 abc0001 0.00 0.0 0.00
1 abc0002 6164.16 5131.0 1033.16
2 abc0003 5273.52 3100.0 2173.52
3 abc0004 6142.35 4350.0 1792.35
4 abc0005 7743.68 3450.0 4293.68
d) Total Payment Progression.¶

This is the cumulative amount paid divided by the nominal contract value.

In [5]:
#Calculate total Payment Progression
dd["payment_progression"] = dd["cumulative_amount_paid"]/dd["nominal_contract_value"]

#Subset the data for proper viewing and quick inspection.
payment_progression = dd[["contract_reference", "cumulative_amount_paid", "nominal_contract_value", "payment_progression"]]
payment_progression.head(5)
Out[5]:
contract_reference cumulative_amount_paid nominal_contract_value payment_progression
0 abc0001 0.0 0.00 NaN
1 abc0002 5131.0 8171.86 0.627886
2 abc0003 3100.0 6851.86 0.452432
3 abc0004 4350.0 10451.86 0.416194
4 abc0005 3450.0 10271.86 0.335869
e) Expected Payment Progression.¶

This is the expected cumulative amount paid divided by the nominal contract value .

In [6]:
#The Expected Payment progression
dd["expected_payment_progression"] = dd["expected_cumulative_amount_paid"]/dd["nominal_contract_value"]

#Subset the data for proper viewing and quick inspection.
expected_payment_progression = dd[["contract_reference", "expected_cumulative_amount_paid", "nominal_contract_value", "expected_payment_progression"]]
expected_payment_progression.head(5)
Out[6]:
contract_reference expected_cumulative_amount_paid nominal_contract_value expected_payment_progression
0 abc0001 0.00 0.00 NaN
1 abc0002 6164.16 8171.86 0.754315
2 abc0003 5273.52 6851.86 0.769648
3 abc0004 6142.35 10451.86 0.587680
4 abc0005 7743.68 10271.86 0.753873
f) Loan Type.¶

Any entry in the name column that contains ‘Individual’ is an Individual Loan, any entry that contains ‘Group’ is a Group Loan, any entry that contains ‘Paygo’ is a Paygo Loan and any entry that contains ‘Cash’ is a Cash Sale.

In [7]:
#write function to calaculate loan type
def loan_type(name):
    loan_type = ""
    if 'Individual' in name:
        loan_type = "Individual Loan"  
    elif 'Group' in name:
        loan_type = "Group Loan"   
    elif 'PayGo' in name:
        loan_type = "Paygo Loan" 
    elif 'Cash' in name:
        loan_type = "Cash Sale"
    return loan_type

#Apply the function
dd['loan_type'] = dd['name'].apply(loan_type)

#Subset the data for proper viewing and quick inspection.
loan_type = dd[["contract_reference", "name", "loan_type"]]
loan_type.head(5)
Out[7]:
contract_reference name loan_type
0 abc0001 Cash Sales IPP Cash Sale
1 abc0002 Individual loan Individual Loan
2 abc0003 Individual loan Individual Loan
3 abc0004 Individual loan Individual Loan
4 abc0005 Individual loan Individual Loan

Task 2: Let’s assume that the provided dataset is in a relational database e.g Postgres or Microsoft sql.¶

write an SQL query that assigns clients with PAR Statuses based on repayment, similar to the python task above. Note: This is just for the PAR Status and not all the above-derived metrics.

a) Setting up my database¶

I created a MYSQL database instance on AWS (Free tier), so that I'm able to ensure the queries I'm writing are working. The database will also help me better present the solution.

Then used SQLAlchemy to connect Pandas API to the database.

In [8]:
#Connecting to my database.
#No risk of sharing the DB credentials.
import sqlalchemy as sa

#Set Up my credentials
engine = sa.create_engine('postgresql://kevegnjhongrsq:7a9c90ae154a234452ef3ff25d2968512d9155aced9ac757cf20adc34fdf09b7@ec2-54-243-226-219.compute-1.amazonaws.com:5432/dd7lo89ujeuuiu')

#Query my loanbook table
#I have updated my database table with columns which I intend to use only.
df = pd.read_sql_query('select * from loanbook limit 5;',con=engine)
df
Out[8]:
contract_reference status next_contract_payment_due_date today date_difference
0 abc0044 Active 2022-06-19 09:21:05.925129 2022-10-25 09:31:07.209581 128
1 abc0045 Active 2022-08-26 02:16:24.213503 2022-10-25 09:31:07.209581 60
2 abc0046 Active 2022-10-25 06:23:21.975138 2022-10-25 09:31:07.209581 0
3 abc0047 Active 2022-10-22 09:10:06.427091 2022-10-25 09:31:07.209581 3
4 abc0048 Active 2022-10-24 04:30:48.873412 2022-10-25 09:31:07.209581 1
b) Add a computed column to get "todays date"¶

The next step was to alter the table and add a column with a timestamp of todays date.

/*Add today column */
ALTER TABLE loanbook ADD COLUMN today TIMESTAMP DEFAULT NOW();
In [9]:
#Query my loanbook table to check results of above query.
df = pd.read_sql_query('select contract_reference, status,next_contract_payment_due_date,today from loanbook limit 5;',con=engine)
df
Out[9]:
contract_reference status next_contract_payment_due_date today
0 abc0044 Active 2022-06-19 09:21:05.925129 2022-10-25 09:31:07.209581
1 abc0045 Active 2022-08-26 02:16:24.213503 2022-10-25 09:31:07.209581
2 abc0046 Active 2022-10-25 06:23:21.975138 2022-10-25 09:31:07.209581
3 abc0047 Active 2022-10-22 09:10:06.427091 2022-10-25 09:31:07.209581
4 abc0048 Active 2022-10-24 04:30:48.873412 2022-10-25 09:31:07.209581
c) Add a computed column to get the difference in days between current date and next contract payment due.¶

This was achieved by first add date_difference column to the loanbook table. Using the command below

/* Difference in days between next contract payment_due date */
ALTER TABLE loanbook ADD date_difference INTEGER;

The next step was to compute the actual date difference.

/* Difference in days between next contract payment_due date */
UPDATE loanbook t1 SET date_difference = (select extract(day from today - next_contract_payment_due_date) as days from loanbook t2 where t1.contract_reference = t2.contract_reference);
In [10]:
#Query my loanbook table to check results of above query.
df = pd.read_sql_query('select contract_reference, status,next_contract_payment_due_date,today,date_difference from loanbook limit 5;',con=engine)
df
Out[10]:
contract_reference status next_contract_payment_due_date today date_difference
0 abc0044 Active 2022-06-19 09:21:05.925129 2022-10-25 09:31:07.209581 128
1 abc0045 Active 2022-08-26 02:16:24.213503 2022-10-25 09:31:07.209581 60
2 abc0046 Active 2022-10-25 06:23:21.975138 2022-10-25 09:31:07.209581 0
3 abc0047 Active 2022-10-22 09:10:06.427091 2022-10-25 09:31:07.209581 3
4 abc0048 Active 2022-10-24 04:30:48.873412 2022-10-25 09:31:07.209581 1
d) Final querying and computation of PAR¶

Finally I used select and case to compute PAR.

/* Calculate PAR */
select contract_reference,status, loanbook.date_difference,
       (case when date_difference < 0 then 'On Time'
             when date_difference >= 0 and date_difference <= 7 then 'PAR0-7'
             when date_difference >= 8 and date_difference <= 30 then 'PAR8-30'
             when date_difference >= 31 and date_difference <= 90 then 'PAR31-90'
             when date_difference >= 91 then 'PAR90+'
             else 'something else'
        end) as PAR FROM loanbook;;
In [11]:
#Final query to calculte PAR.
df = pd.read_sql_query(""" select contract_reference,status, loanbook.date_difference,
       (case when date_difference < 0 then 'On Time'
             when date_difference >= 0 and date_difference <= 7 then 'PAR0-7'
             when date_difference >= 8 and date_difference <= 30 then 'PAR8-30'
             when date_difference >= 31 and date_difference <= 90 then 'PAR31-90'
             when date_difference >= 91 then 'PAR90+'
             else 'something else'
        end) as PAR FROM loanbook;
""",con=engine)
df.head(5)
Out[11]:
contract_reference status date_difference par
0 abc0044 Active 128 PAR90+
1 abc0045 Active 60 PAR31-90
2 abc0046 Active 0 PAR0-7
3 abc0047 Active 3 PAR0-7
4 abc0048 Active 1 PAR0-7

Task 3: Using what you have derived above and the provided dataset, perform Exploratory Data Analysis in Python and share any key insights.¶

a) Get the most popular loan type.¶

This was achieved by counting the most popular loan types. Group loans where most popular loans (56%) followed by individual loans (35%).

Kindly note the chart is interactive, and you can download png file of the chart to embed in a report

In [12]:
#Plot Pie chart - Plotly
import plotly.express as px
dd3  = dd.groupby(['loan_type']).size().reset_index(name  = 'counts')
dd3 = dd3.sort_values(by = ["counts"],ascending=False)

#Draw plotly
fig =px.pie(dd3, values='counts', names='loan_type',hole=.3)
fig.show()
b) Get most Popular loans per status.¶

Group loans were the most active loan types follwedd by individual loans. Group loans also led in completed loans closely followed by cash sale loans.

In [13]:
#Popular loans loan type and status
dd2  = dd.groupby(['status', 'loan_type']).size().reset_index(name  = 'counts')
#View the raw data
dd2
Out[13]:
status loan_type counts
0 Active Group Loan 462
1 Active Individual Loan 292
2 Completed Cash Sale 84
3 Completed Group Loan 100
4 Completed Individual Loan 60
5 Completed Paygo Loan 1
6 Defaulted Group Loan 1

Ploting the most popular loans per status using grouped bar chart.

Kindly note the chart is interactive, and you can download png file of the chart to embed in a report

In [14]:
import plotly.graph_objects as go
status=list(dd2["status"].unique())

fig = go.Figure(data=[
    go.Bar(name='Group Loan', x=status, y=list(dd2.loc[dd2["loan_type"] == "Group Loan"]["counts"])),
    go.Bar(name='Individual Loan', x=status,  y=list(dd2.loc[dd2["loan_type"] == "Individual Loan"]["counts"])),
     go.Bar(name='Cash Sale', x=status,  y=list(dd2.loc[dd2["loan_type"] == "Cash Sale"]["counts"])),
    go.Bar(name='Paygo Loan', x=status,  y=list(dd2.loc[dd2["loan_type"] == "Paygo Loan"]["counts"]))
    
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()
c) Top arrears per loan type.¶

Get the loan type with most arrears. Group loans had most arrears at 1,297,484.

In [15]:
top_arrears_per_loan_type = dd.groupby(['loan_type'])['total_arrears'].sum().reset_index()
top_arrears_per_loan_type = top_arrears_per_loan_type.sort_values(by = ["total_arrears"],ascending=False)
top_arrears_per_loan_type
Out[15]:
loan_type total_arrears
1 Group Loan 1297484.26
2 Individual Loan 579104.54
0 Cash Sale 0.00
3 Paygo Loan -864.00
d) Top borrowers - Indviduals with highest arrears.¶

The individual with contract reference number abc0856 had the highest amount of arrears at 16,556.

In [16]:
top_5_highest_borrowers = dd.groupby(['contract_reference'])['total_arrears'].sum().reset_index()
top_5_highest_borrowers = top_5_highest_borrowers.sort_values(by = ["total_arrears"],ascending=False)
top_5_highest_borrowers.head(5)
Out[16]:
contract_reference total_arrears
855 abc0856 16556.14
640 abc0641 15769.96
469 abc0470 14392.79
885 abc0886 14106.22
387 abc0388 13398.98
e) Average total arrears.¶

The mean of total arrears was 1,875. The total arrears also have a perculiar destribution with first quartile being 0.

In [17]:
dd['total_arrears'].describe()
Out[17]:
count     1000.000000
mean      1875.724800
std       3053.248041
min     -14396.000000
25%          0.000000
50%        796.450000
75%       3149.250000
max      16556.140000
Name: total_arrears, dtype: float64
f) Arrears Per region.¶

The region with identity l3_entity_id equal to 5121, had the highest total arrears at 1,774,04090. . This represents over 96% of all total arrears

In [18]:
top_5_highest_region = dd.groupby(['l3_entity_id'])['total_arrears'].sum().reset_index()
top_5_highest_region = top_5_highest_region.sort_values(by = ["total_arrears"],ascending=False)
top_5_highest_region = top_5_highest_region.head(5)
top_5_highest_region
Out[18]:
l3_entity_id total_arrears
9 5121 1774090.16
7 5084 30259.73
2 5074 15314.20
3 5076 14317.78
1 5073 12732.97

Kindly note the chart is interactive, and you can download png file of the chart to embed in a report

In [19]:
#Draw plotly
fig =px.pie(top_5_highest_region, values='total_arrears', names='l3_entity_id',hole=.3)
fig.show()